101 research outputs found

    Foundational principles for large scale inference: Illustrations through correlation mining

    Full text link
    When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number nn of acquired samples (statistical replicates) is far fewer than the number pp of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size nn is fixed, and the dimension pp grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks

    The Hoffmann-Jorgensen inequality in metric semigroups

    Full text link
    We prove a refinement of the inequality by Hoffmann-Jorgensen that is significant for three reasons. First, our result improves on the state-of-the-art even for real-valued random variables. Second, the result unifies several versions in the Banach space literature, including those by Johnson and Schechtman [Ann. Probab. 17 (1989)], Klass and Nowicki [Ann. Probab. 28 (2000)], and Hitczenko and Montgomery-Smith [Ann. Probab. 29 (2001)]. Finally, we show that the Hoffmann-Jorgensen inequality (including our generalized version) holds not only in Banach spaces but more generally, in a very primitive mathematical framework required to state the inequality: a metric semigroup G\mathscr{G}. This includes normed linear spaces as well as all compact, discrete, or (connected) abelian Lie groups.Comment: 11 pages, published in the Annals of Probability. The Introduction section shares motivating examples with arXiv:1506.0260

    Retaining positive definiteness in thresholded matrices

    Get PDF
    Positive definite (p.d.) matrices arise naturally in many areas within mathematics and also feature extensively in scientific applications. In modern high-dimensional applications, a common approach to finding sparse positive definite matrices is to threshold their small off-diagonal elements. This thresholding, sometimes referred to as hard-thresholding, sets small elements to zero. Thresholding has the attractive property that the resulting matrices are sparse, and are thus easier to interpret and work with. In many applications, it is often required, and thus implicitly assumed, that thresholded matrices retain positive definiteness. In this paper we formally investigate the algebraic properties of p.d. matrices which are thresholded. We demonstrate that for positive definiteness to be preserved, the pattern of elements to be set to zero has to necessarily correspond to a graph which is a union of disconnected complete components. This result rigorously demonstrates that, except in special cases, positive definiteness can be easily lost. We then proceed to demonstrate that the class of diagonally dominant matrices is not maximal in terms of retaining positive definiteness when thresholded. Consequently, we derive characterizations of matrices which retain positive definiteness when thresholded with respect to important classes of graphs. In particular, we demonstrate that retaining positive definiteness upon thresholding is governed by complex algebraic conditions

    Integration and measures on the space of countable labelled graphs

    Full text link
    In this paper we develop a rigorous foundation for the study of integration and measures on the space G(V)\mathscr{G}(V) of all graphs defined on a countable labelled vertex set VV. We first study several interrelated σ\sigma-algebras and a large family of probability measures on graph space. We then focus on a "dyadic" Hamming distance function ψ,2\left\| \cdot \right\|_{\psi,2}, which was very useful in the study of differentiation on G(V)\mathscr{G}(V). The function ψ,2\left\| \cdot \right\|_{\psi,2} is shown to be a Haar measure-preserving bijection from the subset of infinite graphs to the circle (with the Haar/Lebesgue measure), thereby naturally identifying the two spaces. As a consequence, we establish a "change of variables" formula that enables the transfer of the Riemann-Lebesgue theory on R\mathbb{R} to graph space G(V)\mathscr{G}(V). This also complements previous work in which a theory of Newton-Leibnitz differentiation was transferred from the real line to G(V)\mathscr{G}(V) for countable VV. Finally, we identify the Pontryagin dual of G(V)\mathscr{G}(V), and characterize the positive definite functions on G(V)\mathscr{G}(V).Comment: 15 pages, LaTe

    The Khinchin-Kahane and Levy inequalities for abelian metric groups, and transfer from normed (abelian semi)groups to Banach spaces

    Full text link
    The Khinchin-Kahane inequality is a fundamental result in the probability literature, with the most general version to date holding in Banach spaces. Motivated by modern settings and applications, we generalize this inequality to arbitrary metric groups which are abelian. If instead of abelian one assumes the group's metric to be a norm (i.e., Z>0\mathbb{Z}_{>0}-homogeneous), then we explain how the inequality improves to the same one as in Banach spaces. This occurs via a "transfer principle" that helps carry over questions involving normed metric groups and abelian normed semigroups into the Banach space framework. This principle also extends the notion of the expectation to random variables with values in arbitrary abelian normed metric semigroups G\mathscr{G}. We provide additional applications, including studying weakly p\ell_p G\mathscr{G}-valued sequences and related Rademacher series. On a related note, we also formulate a "general" Levy inequality, with two features: (i) It subsumes several known variants in the Banach space literature; and (ii) We show the inequality in the minimal framework required to state it: abelian metric groups.Comment: 15 pages, Introduction section shares motivating examples with arXiv:1506.02605. Significant revisions to the exposition. Final version, to appear in Journal of Mathematical Analysis and Application

    A Methodology for Robust Multiproxy Paleoclimate Reconstructions and Modeling of Temperature Conditional Quantiles

    Full text link
    Great strides have been made in the field of reconstructing past temperatures based on models relating temperature to temperature-sensitive paleoclimate proxies. One of the goals of such reconstructions is to assess if current climate is anomalous in a millennial context. These regression based approaches model the conditional mean of the temperature distribution as a function of paleoclimate proxies (or vice versa). Some of the recent focus in the area has considered methods which help reduce the uncertainty inherent in such statistical paleoclimate reconstructions, with the ultimate goal of improving the confidence that can be attached to such endeavors. A second important scientific focus in the subject area is the area of forward models for proxies, the goal of which is to understand the way paleoclimate proxies are driven by temperature and other environmental variables. In this paper we introduce novel statistical methodology for (1) quantile regression with autoregressive residual structure, (2) estimation of corresponding model parameters, (3) development of a rigorous framework for specifying uncertainty estimates of quantities of interest, yielding (4) statistical byproducts that address the two scientific foci discussed above. Our statistical methodology demonstrably produces a more robust reconstruction than is possible by using conditional-mean-fitting methods. Our reconstruction shares some of the common features of past reconstructions, but also gains useful insights. More importantly, we are able to demonstrate a significantly smaller uncertainty than that from previous regression methods. In addition, the quantile regression component allows us to model, in a more complete and flexible way than least squares, the conditional distribution of temperature given proxies. This relationship can be used to inform forward models relating how proxies are driven by temperature
    corecore